Goto

Collaborating Authors

 best split


Modifying Final Splits of Classification Tree for Fine-tuning Subpopulation Target in Policy Making

arXiv.org Machine Learning

Policymakers often use Classification and Regression Trees (CART) to partition populations based on binary outcomes and target subpopulations whose probability of the binary event exceeds a threshold. However, classic CART and knowledge distillation method whose student model is a CART (referred to as KD-CART) do not minimize the misclassification risk associated with classifying the latent probabilities of these binary events. To reduce the misclassification risk, we propose two methods, Penalized Final Split (PFS) and Maximizing Distance Final Split (MDFS). PFS incorporates a tunable penalty into the standard CART splitting criterion function. MDFS maximizes a weighted sum of distances between node means and the threshold. It can point-identify the optimal split under the unique intersect latent probability assumption. In addition, we develop theoretical result for MDFS splitting rule estimation, which has zero asymptotic risk. Through extensive simulation studies, we demonstrate that these methods predominately outperform classic CART and KD-CART in terms of misclassification error. Furthermore, in our empirical evaluations, these methods provide deeper insights than the two baseline methods.


Online Gradient Boosting Decision Tree: In-Place Updates for Efficient Adding/Deleting Data

arXiv.org Machine Learning

Gradient Boosting Decision Tree (GBDT) is one of the most popular machine learning models in various applications. However, in the traditional settings, all data should be simultaneously accessed in the training procedure: it does not allow to add or delete any data instances after training. In this paper, we propose an efficient online learning framework for GBDT supporting both incremental and decremental learning. To the best of our knowledge, this is the first work that considers an in-place unified incremental and decremental learning on GBDT. To reduce the learning cost, we present a collection of optimizations for our framework, so that it can add or delete a small fraction of data on the fly. We theoretically show the relationship between the hyper-parameters of the proposed optimizations, which enables trading off accuracy and cost on incremental and decremental learning. The backdoor attack results show that our framework can successfully inject and remove backdoor in a well-trained model using incremental and decremental learning, and the empirical results on public datasets confirm the effectiveness and efficiency of our proposed online learning framework and optimizations.


DynFrs: An Efficient Framework for Machine Unlearning in Random Forest

arXiv.org Artificial Intelligence

Random Forests are widely recognized for establishing efficacy in classification and regression tasks, standing out in various domains such as medical diagnosis, finance, and personalized recommendations. These domains, however, are inherently sensitive to privacy concerns, as personal and confidential data are involved. With increasing demand for the right to be forgotten, particularly under regulations such as GDPR and CCPA, the ability to perform machine unlearning has become crucial for Random Forests. However, insufficient attention was paid to this topic, and existing approaches face difficulties in being applied to real-world scenarios. Addressing this gap, we propose the DynFrs framework designed to enable efficient machine unlearning in Random Forests while preserving predictive accuracy. Dynfrs leverages subsampling method Occ(q) and a lazy tag strategy Lzy, and is still adaptable to any Random Forest variant. In essence, Occ(q) ensures that each sample in the training set occurs only in a proportion of trees so that the impact of deleting samples is limited, and Lzy delays the reconstruction of a tree node until necessary, thereby avoiding unnecessary modifications on tree structures. In experiments, applying Dynfrs on Extremely Randomized Trees yields substantial improvements, achieving orders of magnitude faster unlearning performance and better predictive accuracy than existing machine unlearning methods for Random Forests.


Timber! Poisoning Decision Trees

arXiv.org Machine Learning

We present Timber, the first white-box poisoning attack targeting decision trees. Timber is based on a greedy attack strategy leveraging sub-tree retraining to efficiently estimate the damage performed by poisoning a given training instance. The attack relies on a tree annotation procedure which enables sorting training instances so that they are processed in increasing order of computational cost of sub-tree retraining. This sorting yields a variant of Timber supporting an early stopping criterion designed to make poisoning attacks more efficient and feasible on larger datasets. We also discuss an extension of Timber to traditional random forest models, which is useful because decision trees are normally combined into ensembles to improve their predictive power. Our experimental evaluation on public datasets shows that our attacks outperform existing baselines in terms of effectiveness, efficiency or both. Moreover, we show that two representative defenses can mitigate the effect of our attacks, but fail at effectively thwarting them.


Unbiased Gradient Boosting Decision Tree with Unbiased Feature Importance

arXiv.org Artificial Intelligence

Gradient Boosting Decision Tree (GBDT) has achieved remarkable success in a wide variety of applications. The split finding algorithm, which determines the tree construction process, is one of the most crucial components of GBDT. However, the split finding algorithm has long been criticized for its bias towards features with a large number of potential splits. This bias introduces severe interpretability and overfitting issues in GBDT. To this end, we provide a fine-grained analysis of bias in GBDT and demonstrate that the bias originates from 1) the systematic bias in the gain estimation of each split and 2) the bias in the split finding algorithm resulting from the use of the same data to evaluate the split improvement and determine the best split. Based on the analysis, we propose unbiased gain, a new unbiased measurement of gain importance using out-of-bag samples. Moreover, we incorporate the unbiased property into the split finding algorithm and develop UnbiasedGBM to solve the overfitting issue of GBDT. We assess the performance of UnbiasedGBM and unbiased gain in a large-scale empirical study comprising 60 datasets and show that: 1) UnbiasedGBM exhibits better performance than popular GBDT implementations such as LightGBM, XGBoost, and Catboost on average on the 60 datasets and 2) unbiased gain achieves better average performance in feature selection than popular feature importance methods. The codes are available at https://github.com/ZheyuAqaZhang/UnbiasedGBM.


Cheat-Sheet: Decision Trees Terminology

#artificialintelligence

Now, that we know the basic building blocks of a decision tree, we need to know how to grow one. Creating a decision tree describes the process of dividing the input space into several distinct, non-overlapping sub-spaces. In order to divide the input space, we have to test all features and threshold values to find the optimal split that minimizes our cost function. Once we obtain the best split, we can continue to grow our tree recursively. The process is termed recursive since each sub-space may be split an indefinite number of times until a stopping criterion (e.g.


Decision Trees Explained

#artificialintelligence

In this post, I will explain Decision Trees in simple terms. It could be considered a Decision Trees for dummies post, however, I've never really liked that expression. In the Machine Learning world, Decision Trees are a kind of non parametric models, that can be used for both classification and regression. This means that Decision trees are flexible models that don't increase their number of parameters as we add more features (if we build them correctly), and they can either output a categorical prediction (like if a plant is of a certain kind or not) or a numerical prediction (like the price of a house). They are constructed using two kinds of elements: nodes and branches.


An Experimental Evaluation of Large Scale GBDT Systems

arXiv.org Machine Learning

Gradient boosting decision tree (GBDT) is a widely-used machine learning algorithm in both data analytic competitions and real-world industrial applications. Further, driven by the rapid increase in data volume, efforts have been made to train GBDT in a distributed setting to support large-scale workloads. However, we find it surprising that the existing systems manage the training dataset in different ways, but none of them have studied the impact of data management. To that end, this paper aims to study the pros and cons of different data management methods regarding the performance of distributed GBDT. We first introduce a quadrant categorization of data management policies based on data partitioning and data storage. Then we conduct an in-depth systematic analysis and summarize the advantageous scenarios of the quadrants. Based on the analysis, we further propose a novel distributed GBDT system named Vero, which adopts the unexplored composition of vertical partitioning and row-store and suits for many large-scale cases. To validate our analysis empirically, we implement different quadrants in the same code base and compare them under extensive workloads, and finally compare Vero with other state-of-the-art systems over a wide range of datasets. Our theoretical and experimental results provide a guideline on choosing a proper data management policy for a given workload.


Machine Learning Basics - Random Forest

#artificialintelligence

RF is based on decision trees. In machine learning decision trees are a technique for creating predictive models. They are called decision trees because the prediction follows several branches of "if… then…" decision splits - similar to the branches of a tree. If we imagine that we start with a sample, which we want to predict a class for, we would start at the bottom of a tree and travel up the trunk until we come to the first split-off branch. This split can be thought of as a feature in machine learning, let's say it would be "age"; we would now make a decision about which branch to follow: "if our sample has an age bigger than 30, continue along the left branch, else continue along the right branch".


Learn ML Algorithms by coding: Decision Trees – Lethal Brains

#artificialintelligence

Let us build a crude decision tree which predicts the outcome in probabilities (In Scikit learn, predict method returns the predicted classes while the predict_proba method returns the predicted probabilities. What do you think would be most simple and easy way to predict the probabilities? I have touched it up a little bit. The fit method accepts a dataframe(data) and a string for the target attribute(target). Both of the them are then assigned to the object.